黑框模型的鲁棒性研究被认为是基于结构方程和从数据中学到的预测模型的数值模型的必要任务。这些研究必须评估模型的鲁棒性,以实现其输入的可能错误指定(例如,协变量转移)。通过不确定性定量(UQ)的棱镜对黑盒模型的研究通常基于涉及输入上施加的概率结构的灵敏度分析,而ML模型仅由观察到的数据构建。我们的工作旨在通过为这两个范式提供相关且易于使用的工具来统一UQ和ML可解释性方法。为了为鲁棒性研究提供一个通用且易于理解的框架,我们定义了依赖于概率指标之间的瓦斯汀距离的分位数约束和投影的输入信息的扰动,同时保留其依赖性结构。我们表明,可以通过分析解决这个扰动问题。通过等渗多项式近似确保规律性约束会导致更平滑的扰动,这在实践中可能更适合。从UQ和ML领域进行的实际案例研究的数值实验突出了此类研究的计算可行性,并提供了对黑盒模型鲁棒性的局部和全球见解,以输入扰动。
translated by 谷歌翻译
在本文中,我们提出了一种新的可解释性形式主义,旨在阐明测试集的每个输入变量如何影响机器学习模型的预测。因此,我们根据训练有素的机器学习决策规则提出了一个群体的解释性形式,它们是根据其对输入变量分布的可变性的反应。为了强调每个输入变量的影响,这种形式主义使用信息理论框架,该框架量化了基于熵投影的所有输入输出观测值的影响。因此,这是第一个统一和模型不可知的形式主义,使数据科学家能够解释输入变量之间的依赖性,它们对预测错误的影响以及它们对输出预测的影响。在大型样本案例中提供了熵投影的收敛速率。最重要的是,我们证明,计算框架中的解释具有低算法的复杂性,使其可扩展到现实生活中的大数据集。我们通过解释通过在各种数据集上使用XGBoost,随机森林或深层神经网络分类器(例如成人收入,MNIST,CELEBA,波士顿住房,IRIS以及合成的)上使用的复杂决策规则来说明我们的策略。最终,我们明确了基于单个观察结果的解释性策略石灰和摇摆的差异。可以通过使用自由分布的Python工具箱https://gems-ai.aniti.fr/来复制结果。
translated by 谷歌翻译
Mixtures of von Mises-Fisher distributions can be used to cluster data on the unit hypersphere. This is particularly adapted for high-dimensional directional data such as texts. We propose in this article to estimate a von Mises mixture using a l 1 penalized likelihood. This leads to sparse prototypes that improve clustering interpretability. We introduce an expectation-maximisation (EM) algorithm for this estimation and explore the trade-off between the sparsity term and the likelihood one with a path following algorithm. The model's behaviour is studied on simulated data and, we show the advantages of the approach on real data benchmark. We also introduce a new data set on financial reports and exhibit the benefits of our method for exploratory analysis.
translated by 谷歌翻译
G-Enum histograms are a new fast and fully automated method for irregular histogram construction. By framing histogram construction as a density estimation problem and its automation as a model selection task, these histograms leverage the Minimum Description Length principle (MDL) to derive two different model selection criteria. Several proven theoretical results about these criteria give insights about their asymptotic behavior and are used to speed up their optimisation. These insights, combined to a greedy search heuristic, are used to construct histograms in linearithmic time rather than the polynomial time incurred by previous works. The capabilities of the proposed MDL density estimation method are illustrated with reference to other fully automated methods in the literature, both on synthetic and large real-world data sets.
translated by 谷歌翻译
This paper presents an introduction to the state-of-the-art in anomaly and change-point detection. On the one hand, the main concepts needed to understand the vast scientific literature on those subjects are introduced. On the other, a selection of important surveys and books, as well as two selected active research topics in the field, are presented.
translated by 谷歌翻译
Machine learning (ML) models are nowadays used in complex applications in various domains, such as medicine, bioinformatics, and other sciences. Due to their black box nature, however, it may sometimes be hard to understand and trust the results they provide. This has increased the demand for reliable visualization tools related to enhancing trust in ML models, which has become a prominent topic of research in the visualization community over the past decades. To provide an overview and present the frontiers of current research on the topic, we present a State-of-the-Art Report (STAR) on enhancing trust in ML models with the use of interactive visualization. We define and describe the background of the topic, introduce a categorization for visualization techniques that aim to accomplish this goal, and discuss insights and opportunities for future research directions. Among our contributions is a categorization of trust against different facets of interactive ML, expanded and improved from previous research. Our results are investigated from different analytical perspectives: (a) providing a statistical overview, (b) summarizing key findings, (c) performing topic analyses, and (d) exploring the data sets used in the individual papers, all with the support of an interactive web-based survey browser. We intend this survey to be beneficial for visualization researchers whose interests involve making ML models more trustworthy, as well as researchers and practitioners from other disciplines in their search for effective visualization techniques suitable for solving their tasks with confidence and conveying meaning to their data.
translated by 谷歌翻译
In recent years the applications of machine learning models have increased rapidly, due to the large amount of available data and technological progress.While some domains like web analysis can benefit from this with only minor restrictions, other fields like in medicine with patient data are strongerregulated. In particular \emph{data privacy} plays an important role as recently highlighted by the trustworthy AI initiative of the EU or general privacy regulations in legislation. Another major challenge is, that the required training \emph{data is} often \emph{distributed} in terms of features or samples and unavailable for classicalbatch learning approaches. In 2016 Google came up with a framework, called \emph{Federated Learning} to solve both of these problems. We provide a brief overview on existing Methods and Applications in the field of vertical and horizontal \emph{Federated Learning}, as well as \emph{Fderated Transfer Learning}.
translated by 谷歌翻译
Co-clustering is a class of unsupervised data analysis techniques that extract the existing underlying dependency structure between the instances and variables of a data table as homogeneous blocks. Most of those techniques are limited to variables of the same type. In this paper, we propose a mixed data co-clustering method based on a two-step methodology. In the first step, all the variables are binarized according to a number of bins chosen by the analyst, by equal frequency discretization in the numerical case, or keeping the most frequent values in the categorical case. The second step applies a co-clustering to the instances and the binary variables, leading to groups of instances and groups of variable parts. We apply this methodology on several data sets and compare with the results of a Multiple Correspondence Analysis applied to the same data.
translated by 谷歌翻译
Co-clustering is a data mining technique used to extract the underlying block structure between the rows and columns of a data matrix. Many approaches have been studied and have shown their capacity to extract such structures in continuous, binary or contingency tables. However, very little work has been done to perform co-clustering on mixed type data. In this article, we extend the latent block models based co-clustering to the case of mixed data (continuous and binary variables). We then evaluate the effectiveness of the proposed approach on simulated data and we discuss its advantages and potential limits.
translated by 谷歌翻译
In the upcoming years, artificial intelligence (AI) is going to transform the practice of medicine in most of its specialties. Deep learning can help achieve better and earlier problem detection, while reducing errors on diagnosis. By feeding a deep neural network (DNN) with the data from a low-cost and low-accuracy sensor array, we demonstrate that it becomes possible to significantly improve the measurements' precision and accuracy. The data collection is done with an array composed of 32 temperature sensors, including 16 analog and 16 digital sensors. All sensors have accuracies between 0.5-2.0$^\circ$C. 800 vectors are extracted, covering a range from to 30 to 45$^\circ$C. In order to improve the temperature readings, we use machine learning to perform a linear regression analysis through a DNN. In an attempt to minimize the model's complexity in order to eventually run inferences locally, the network with the best results involves only three layers using the hyperbolic tangent activation function and the Adam Stochastic Gradient Descent (SGD) optimizer. The model is trained with a randomly-selected dataset using 640 vectors (80% of the data) and tested with 160 vectors (20%). Using the mean squared error as a loss function between the data and the model's prediction, we achieve a loss of only 1.47x10$^{-4}$ on the training set and 1.22x10$^{-4}$ on the test set. As such, we believe this appealing approach offers a new pathway towards significantly better datasets using readily-available ultra low-cost sensors.
translated by 谷歌翻译